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It is tempting to treat frequency trends from the Google Books data sets as indicators of the 
“true” popularity of various words and phrases. Doing so allows us to draw quantitatively strong 
conclusions about the evolution of cultural perception of a given topic, such as time or gender. 
However, the Google Books corpus suffers from a number of limitations which make it an obscure 
mask of cultural popularity. A primary issue is that the corpus is in effect a library, containing 
one of each book. A single, prolific author is thereby able to noticeably insert new phrases into the 
Google Books lexicon, whether the author is widely read or not. With this understood, the Google 
Books corpus remains an important data set to be considered more lexicon-like than text-like. Here, 
we show that a distinct problematic feature arises from the inclusion of scientific texts, which have 
become an increasingly substantive portion of the corpus throughout the 1900s. The result is a surge 
of phrases typical to academic articles but less common in general, such as references to time in the 
form of citations. We highlight these dynamics by examining and comparing major contributions 
to the statistical divergence of English data sets between decades in the period 1800-2000. We find 
that only the English Fiction data set from the second version of the corpus is not heavily affected 
by professional texts, in clear contrast to the first version of the fiction data set and both nnhltered 
English data sets. Our findings emphasize the need to fully characterize the dynamics of the Google 
Books corpus before using these data sets to draw broad conclusions about cultural and linguistic 
evolution. 

PACS numbers: 


I. INTRODUCTION 

T he Google Books data set is captivating both for its 
availability and its incredible size. The first ver¬ 
sion of the data set, published in 2009, incorporates 
over 5 million books [T]. These are, in turn, a subset 
selected for quality of optical character recognition and 
metadata—e.g., dates of publication—from 15 million 
digitized books, largely provided by university libraries. 
These 5 million books contain over half a trillion words, 
361 billion of which are in English. Along with sep¬ 
arate data sets for American English, British English, 
and English Fiction; the first version also includes Span¬ 
ish, French, German, Russian, Ghinese, and Hebrew data 
sets. The second version, published in 2012 [ 5 ], contains 
8 million books with half a trillion words in English alone, 
and also includes books in Italian. The contents of the 
sampled books are split into case-sensitive n-grams which 
are typically blocks of text separated into n = 1, ..., 5 
pieces by whitespace—e.g., “I” is a 1-gram, and “I am” 
is a 2-gram m- 

A central if subtle and deceptive feature of the Google 
Books corpus, and for others composed in a similar fash¬ 
ion, is that the corpus is a reflection of a library in which 
only one of each book is available. Ideally, we would be 
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able to apply different popularity filters to the corpus. 
For example, we could ask to have n-gram frequencies 
adjusted according to book sales in the UK, library usage 
data in the US, or how often each page in each book is 
read on Amazon’s Kindle service (all over defined periods 
of time). Evidently, incorporating popularity in any use¬ 
ful fashion would be an extremely difficult undertaking 
on the part of Google. 

We are left with the fact that the Google Books library 
has ultimately been furnished by the efforts and choic¬ 
es of authors, editors, and publishing houses, who col¬ 
lectively aim to anticipate or dictate what people will 
read. This adds a further distancing from “true culture” 
as the ability to predict cultural success is often rendered 
fundamentally impossible due to social influence process¬ 
es [3] —we have one seed for each tree but no view of the 
real forest that will emerge. 

We therefore observe that the Google Books corpus 
encodes only a small-scale kind of popularity: how often 
n-grams appear in a library with all books given (in prin¬ 
ciple) equal importance and tied to their year of publi¬ 
cation (new editions and reprints allow some books to 
appear more than once). The corpus is thus more akin 
to a lexicon for a collection of texts, rather than the col¬ 
lection itself. But problematically, because Google Books 
n-grams do have frequency of usage associated with them 
based on this small-scale popularity, the data set readily 
conveys an illusion of large-scale cultural popularity. An 
n-gram which declines in usage frequency over time may 
in fact become more often read by a particular demo¬ 
graphic focused on a specific genre of books. For exam- 
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pie, “Frodo” first appears in the second Google Books 
English Fiction corpus in the mid 1950s and declines 
thereafter in popularity with a few resurgent spikes [3]. 

While this limitation to small-scale popularity tempers 
the kinds of conclusions we can draw, the evolution of 
n-grams within the Google Books corpus—their relative 
abundance, their growth and decay—still gives us a valu¬ 
able lens into how language use and culture has changed 
over time. Our contribution here will be to show: 

1. A principled approach for exploring word and 
phrase evolution; 

2. How the Google Books corpus is challenged in oth¬ 
er, orthogonal respects, particularly by the inclu¬ 
sion of scientific and medical journals; and 

3. How future analyses of the Google Books corpus 
should be considered. 

For ease of comparison with related work, we focus 
primarily on 1-grams from selected English data sets 
between the years 1800 and 2000. In this work, we 
will use the terms “word” and “1-gram” interchangeably 
for the sake of convenience. The total volume of (non¬ 
unique) English 1-grams grows exponentially between 
these years, as shown in Fig. except during major 
conflicts—e.g., the American Givil War and both World 
Wars—when the total volume dips substantially. We 
also observe a slight increase in volume between the first 
and second version of the unfiltered English data set. 
Between the two English Fiction data sets, however, the 
total volume actually decreases considerably, which indi¬ 
cates insufficient filtering was used in producing the first 
version, and immediately suggests the initial English Fic¬ 
tion data set may not be appropriate for any kind of 
analysis. 

The simplest possible analysis involving any Google 
Books data set is to track the relative frequencies of a 
specific set of words or phrases. Examples of such anal¬ 
yses involve words or phrases surrounding individuali¬ 
ty 0 , gender 0 , urbanization [7] , and time 00, all of 
which are of profound interest. However, the strength of 
all conclusions drawn from these must take into account 
both the number of words and phrases in question (any¬ 
where from two 0 to twenty 0 or more at a time) and 
the sampling methods used to build the Google Books 
corpus. 

Many researchers have carried out broad analyses 
of the Google Books corpus, examining properties and 
dynamics of entire languages. These include analyses of 
Zipf’s and Heaps’ laws as applied to the corpus 0, the 
rates of verb regularization 0 , rates of word introduction 
and obsolescence and durations of cultural memory 0 , as 
well as an observed decrease in the need for new words 
in several languages [10]. However, these studies also 
appear to take for granted that the data sets sample in 
a consistent manner from works spanning the last two 
centuries. 



Year 


FIG. 1: The logarithms of the total 1-gram counts for the 
Google Books English data sets (dark gray) and English Fic¬ 
tion data sets (light gray). The dashed and solid curves denote 
the 2009 and 2012 versions of the data sets. In all four exam¬ 
ples, an exponential increase in volume is apparent over time 
with notable exceptions during wartime when the total vol¬ 
ume decreases, clearest during the American Givil War and 
both World Wars. While the total volume for English increas¬ 
es between versions, the volume for English fiction decreases 
drastically, suggesting a more rigorous filtering process. 

Analysis of the emotional content of books suggests a 
lag of roughly a decade between exogenous events and 
their effects in literature complicating the use of the 
Google Books data sets directly as snapshots of cultural 
identity. 

As we will demonstrate, an assumption of unbiased 
sampling of books is not reasonable during the last cen¬ 
tury and especially during recent decades, which is of par¬ 
ticular importance to all analyses concerned with recent 
social change. Since parsing in the data sets is case- 
sensitive, we can give a suggestive illustration of this 
observation in Fig. which displays the relative (nor¬ 
malized) frequencies of “figure” versus “Figure” in both 
versions of the corpus and for both English and English 
Fiction. In both versions of the English data set, the cap¬ 
italized version, “Figure,” surpasses its lowercase coun¬ 
terpart during the 1960s. Since the majority of books in 
the corpus originated in university libraries 0 , a major 
effect of scientific texts on the dynamics of the data set 
is quite plausible. This trend is also apparent—albeit 
delayed—in the first version of the English Fiction data 
set, which again suggests insufficient filtering during the 
compilation process for that version. 

Because of Google Books library-like nature, authors 
are not represented equally or by any measure of popular¬ 
ity in any given data set but are instead roughly by their 
own prolificacy. This leaves room for individual authors 
to have noteworthy effects on the dynamics of the data 
sets, as we will demonstrate in Section Ell 

Lastly, due to copyright laws, the public data sets 
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FIG. 2: Relative frequencies of “Figure” vs “figure” in both 
versions of the Google Books corpus for both English (all) 
and English Fiction. In the English data sets, the capitalized 
term rapidly surpasses the uncapitalized term in the 1960s. 
Eor the first English Fiction data set, this effect is delayed 
until the 1970s. As shown later, only the second version of 
the English Fiction data set demonstrates a filtering of scien¬ 
tific terminology. These trends strongly suggest an increase 
starting around 1900 in the sampling of scientific texts in both 
English data sets and the first English Fiction data set. 


do not include metadata (see supporting online mate¬ 
rial mi and the data are truncated to avoid inference of 
authorship, which severely limits any analysis of censor¬ 
ship mm in the corpus. Under these conditions, we will 
show that much caution must be used when employing 
these data sets—with a possible exception of the second 
version of English Fiction—to draw cultural conclusions 
from the frequencies of words or phrases in the corpus. 


We structure the remainder of the paper as follows. In 
Sec. [nj we describe how to use Jensen-Shannon diver¬ 
gence to highlight the dynamics over time of both ver¬ 
sions of the English and English Fiction data sets, pay¬ 
ing particular attention to key contributing words. In 
Sec. |III[ we display and discuss examples of these high¬ 
lights, exploring the extent of the scientific literature bias 
and issues with individual authors; we also provide a 
detailed inspection of some example decade-decade com¬ 


parisons. We offer concluding remarks in Section IV 


II. METHODS 

A. Statistical divergence between years 

We examine the dynamics of the Google Books cor¬ 
pus by calculating the statistical divergence between the 
distributions of 1-grams in two given years. A commonly 
used measure of statistical divergence is Kullback-Leibler 
(KL) divergence [T3], based on which we use a bounded, 
symmetric measure. Given a language with N unique 
words and 1-gram distributions P in the first year and Q 
in second, the KL divergence between P and Q can be 
expressed as 


N 

DKL{P\\Q)=Y.P^\0g^-, ( 1 ) 

i=l 

where pi is the probability of observing the 1-gram 
random chosen from the 1-gram distribution for first 
year, and qi is the probability of observing the same 
word in the second year. The units of KL divergence 
is bits, and may be interpreted as the average number of 
bits wasted if a text from the first year is encoded effi¬ 
ciently, but according to the distribution from the latter, 
incorrect year. To demonstrate this, we may rewrite the 
previous equation as 


N 

Dkl{P\\Q) = -Y.<1^^^^2P^-H{P), ( 2 ) 

i=l 

where H{P) = —J2iPi^‘^S2Pi is the Shannon 
entropy [Mj , also the average number of bits required per 
word in an efficient encoding for the original distribution; 
and the remaining term is the average number of bits 
required per word in an efficient, but mistaken, encod¬ 
ing of a given text. However, if a single (say, the 
1-gram in the language exists in the first year, but not 
in the second, then qj = 0, and the divergence diverges. 
Since this scenario is not extraordinary for the data sets 
in question, we instead use Jensen-Shannon divergence 
(JSD) |TS] given by 

DjsiPWQ) = l{DKLiP\\M) + DKLiQim), (3) 

where M = ^{P + Q) is a, mixed distribution of the two 
years. This measure of divergence is bounded between 
0 when the distributions are the same and 1 bit in the 
extreme case when there is no overlap between the 1- 
grams in the two distributions. If we begin with a uni¬ 
form distribution of N species and replace k of those 
species with k entirely new ones, the JSD between the 
original and new distribution is fc/V, the proportion of 
species replaced. The JSD is also symmetric, which is an 
added convenience. The JSD may be expressed as 

DjsiP\\Q) = H{M)-l^H{P)+H{Q)), (4) 
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from which it is apparent that a similar waste analogy 
holds as with KL divergence, with the mixed distribution 
taking the place of the approximation regardless of the 
year a text was written. 


B. Key contributions of individual words 


The form for Jensen-Shannon divergence given in Eq.[^ 
can be broken down into contributions from individual 
words, where the contribution from the word to the 
divergence between two years is given by 


Normalized Contribution to Divergence 



Ratio of Smaller to Average Frequency 


DjsA^W Q) = -milog2mi + -(pi logaPi + qi\og2q^)■ 

(5) 

Some rearrangement gives 


Djs,^iP\\Q) = TO,--(r*log2ri + (2-rj)log2(2-ri)), (6) 


where = pi/rrii, so that contribution from an individu¬ 
al word is proportional to the average probability of the 
word, and the proportion depends on the ratio between 
the smaller probability (without loss of generality) and 
the average. Namely, we may reframe the equation above 
as 


DJsAP\\Q)=m^C{n). (7) 

Words with larger average probability yield greater con¬ 
tributions as do those with smaller ratios, r, between 
the smaller and average probability. So while a common 
1-gram—such as “the,” “if,” or a period—changing sub¬ 
tly can have a large effect on the divergence, so can an 
uncommon (or entirely new) word given a sufficient shift 
from one year to the next. The size of the contribution 
relative to the average probability is displayed in Fig. 
for ratios ranging from 0 to 1. C'(rj) is symmetric about 
Ti — 1 (i.e., no change), so no novel behavior is lost by 
omitting the case where > 1 (i.e., when pi is the larg¬ 
er probability). The maximum possible contribution (in 
bits) is precisely the average probability of the word in 
question, which occurs if and only if the smaller prob¬ 
ability is 0. No contribution is made if and only if the 
probability remains unchanged. 

We coarse-grain the data at the level of decades—e.g., 
between 1800-to-1809 and 1990-to-1999—by averaging 
the relative normalized frequency of each unique word in 
a given decade over all years in that decade. (Each year 
is weighted equally.) This allows convenient calculation 
and sorting of contributions to divergence of individual 
1-grams between any two time periods. 


FIG. 3: For the ratio r between the smaller relative proba¬ 
bility of an element and the average, C(r) is the proportion 
of the average contributed to the Jensen-Shannon divergence 
(see Eqs.|^and[^. In particular, if r = 1 (no change), then 
the contribution is zero; if r = 0, the contribution is half its 
probability in the distribution in which it occurs with nonzero 
probability. 
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FIG. 4: Heatmaps showing the JSD between every pair of 
years between 1800 and 2000, contributed by words appear¬ 
ing above a normalized frequency threshold of 10“®. The 
dashed lines highlight the divergences to and from the year 
1880, which are featured in Fig.[^ The off-diagonal elements 
represent divergences between consecutive years, as in Fig.[^ 
The color represents the percentage of the maximum diver¬ 
gence observed in the given time range for each data set. The 
divergence between a year and itself is zero. For any giv¬ 
en year, the divergence increases with the distance (number 
of years) from the diagonal—sharply at first, then gradual¬ 
ly. Interesting features of the maps are the presence of two 
cross-hairs in the first half of the 20th century, which strongly 
suggests a wartime shift in the language, as well as an asym¬ 
metry that suggests a particularly high divergence between 
the first half century and the last quarter century observed. 
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1800 1850 1900 1950 1800 1850 1900 1950 2000 

FIG. 5: JSD between 1880 and each displayed year for given 
data set, corresponding to dashed lines from Fig. Con¬ 
tributions are counted for all words appearing above a 10“® 
threshold in a given year; for the dashed curves, the threshold 
is 10“^. Typical behavior in each case consists of a relatively 
large jump between one year and the next with a more grad¬ 
ual rise afterward (in both directions). Exceptions include 
wartime, particularly the two World Wars, during which the 
divergence is greater than usual; however, after the conclu¬ 
sion of these periods, the cumulative divergence settles back 
to the previous trend. Initial spikiness in (D) is likely due to 
low volume. 


III. RESULTS AND DISCUSSION 

A. Broad view of language evolution within Google 

Books 

Fig. ID shows the JSD between the 1-gram distribu¬ 
tions for every pair of years between 1800 and 2000 con¬ 
tributed by 1-grams present above a threshold normal¬ 
ized frequency of 10“® for both versions of the English 
and English Fiction data sets (i.e., words that appear 
with normalized frequency at least 1 in 10®). 

A major qualitative aspect apparent from the 
heatmaps is a gradual increase in divergence with dif¬ 
ferences in time—the lexicon underlying Google steadi¬ 
ly evolves—though this is strongly curtailed for the sec¬ 
ond English Fiction corpus. We see the heatmaps are 
“pinched” toward the diagonal in the vicinities of the 
two world wars. Also visible is an asymmetry that sug¬ 
gests a particularly high divergence between the first 
half century and the last quarter century observed. We 
examine these effects more closely in Figs. and Hby 
taking two slices of the heatmaps. We specifically con¬ 
sider the divergences of each year compared with 1880 
(dashed lines), and the divergences between consecutive 
years (off-diagonal). To verify qualitative consistency, 
we also include analogous contribution curves using the 
more restrictive threshold of 10“^. 

While the initial divergence between any two consecu¬ 
tive years is noticeable, the divergence increases (for the 
most part) steadily with the time difference. The cross¬ 
hairs from the heatmap resolve into war-time bumps in 
divergence, which quickly settle in peacetime. The larg¬ 
er boost to the divergence in recent decades, however, is 
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FIG. 6: Gonsecutive year (between each year and the fol¬ 
lowing year) base-10 logarithms of JSD, corresponding to 
off-diagonals in Fig. For the solid curves, contributions 
are counted for all words appearing above a 10“® thresh¬ 
old in a given year; for the dashed curves, the threshold is 
10~^. Divergences between consecutive years typically decline 
through the mid-19th century, remain relatively steady until 
the mid-20th century, then continue to decline gradually over 
time. 

more persistent suggesting a more fundamental change in 
the data set, which we will examine in more depth later in 
this section. Divergences between consecutive years typ¬ 
ically decline through the mid-19th century. Divergences 
then remain relatively steady until the mid-20th century, 
then continue to decline gradually over time, which may 
be consistent with previous findings of decreased rates of 
word introduction and increased rates of word obsoles¬ 
cence in many Google Books data sets over time [S] and 
a slowing down of linguistic evolution over time as the 
vocabulary of a language expands m- The initial spikes 
in divergence in the second version of the fiction data 
set are likely due to the lower initial volume observed in 

Fig.0 


B. Decade-decade comparisons using JSD word 
shifts 

1. General observations 

We present “word shifts” m for a few examples of 
inter-decade divergences in Figs. [THT^ specifically com¬ 
paring the 1940s to the 1930s and the 1980s to the 1950s 
for the first unfiltered English data set (Figs. [7[|^ and 
both English Fiction data sets (Figs. [oHl^. We provide 
a full set of such comparisons in the Supporting Informa¬ 
tion hies SI, S2, S3, and S4. For each of the four data 
sets, the largest contributions to all divergences general¬ 
ly appear to be from increased relative frequencies of use 
of words between decades. For the unhltered data sets, 
these are in turn heavily inhuenced by increased mention 
of years, which is less pronounced for English Fiction. 

The 1940s literature, unsurprisingly, features more ref¬ 
erences to Hitler and war than the 1930s, along with 
other World War Il-related military and political terms. 
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Top JSD contributions: 1930s to 1940s Top JSD contributions: 1950s to 1980s 


C 

CC 

CC 


0.4 0.2 0 0.2 0.4 0.6 0.8 1 

Contribution (% of total JSD = 0.00364 bit) 

FIG. 7: (English, all; Version 2.) Top 60 individual contribu- FIG. 8: (English, all; Version 2.) Top 60 individual contribu¬ 
tions of 1-grams to the JSD between the 1930s and the 1940s. tions of 1-grams to the JSD between the 1950s and the 1980s. 

Each contribution is given as a percentage of the total JSD Each contribution is given as a percentage of the total JSD 

(see horizontal axis label) between the two given decades. All (see horizontal axis label) between the two given decades. All 

contributions are positive; bars to the left of center represent contributions are positive; bars to the left of center represent 

words that were more common in the earlier decade, whereas words that were more common in the earlier decade, whereas 

bars to the right represent words that became more common bars to the right represent words that became more common 

in the later decade. in the later decade. 
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Contribution (% of total JSD = 0.0145 bit) 
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Top JSD contributions: 1930s to 1940s 
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Contribution (% of total JSD = 0.0045 bit) 


Top JSD contributions: 1950s to 1980s 
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Contribution (% of total JSD = 0.0132 bit) 


FIG. 9: (English Fiction, Version 1.) Top 60 individual 
contributions of 1-grams to the JSD between the 1930s and 
the 1940s. Each contribution is given as a percentage of the 
total JSD (see horizontal axis label) between the two given 
decades. All contributions are positive; bars to the left of 
center represent words that were more common in the earlier 
decade, whereas bars to the right represent words that became 
more common in the later decade. 


FIG. 10: (English Fiction, Version 1.) Top 60 individual 
contributions of 1-grams to the JSD between the 1950s and 
the 1980s. Each contribution is given as a percentage of the 
total JSD (see horizontal axis label) between the two given 
decades. All contributions are positive; bars to the left of 
center represent words that were more common in the earlier 
decade, whereas bars to the right represent words that became 
more common in the later decade. 





Top JSD contributions: 1950s to 1980s 



Contribution (% of total JSD = 0.00443 bit) Contribution (% of total JSD = 0.00779 bit) 


FIG. 11: (English Fiction, Version 2.) Top 60 individual 
contributions of l-grams to the JSD between the 1930s and 
the 1940s. Each contribution is given as a percentage of the 
total JSD (see horizontal axis label) between the two given 
decades. All contributions are positive; bars to the left of 
center represent words that were more common in the earlier 
decade, whereas bars to the right represent words that became 
more common in the later decade. 


FIG. 12: (English Fiction, Version 2.) Top 60 individual 
contributions of 1-grams to the JSD between the 1950s and 
the 1980s. Each contribution is given as a percentage of the 
total JSD (see horizontal axis label) between the two given 
decades (see title). All contributions are positive; bars to the 
left of center represent words that were more common in the 
earlier decade, whereas bars to the right represent words that 
became more common in the later decade. 
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This is seen regardless of the specific data set used and 
is fairly encouraging. Curiously, regardless of the specific 
data set, a noticeable contribution is given by an increase 
in relative use of the words “Lanny” and “Budd,” in ref¬ 
erence to one character (Lanny Budd) frequently written 
about by Upton Sinclair during that decade. In the fic¬ 
tion data sets, this character dominates the charts. 

2. Second unfiltered English data set: 1930s versus 1940s 

A comparison of the 1930s and 1940s for the second 
version of the unhltered English data set (Fig. shows 
dynamics dominated by references to years. (The first 
version is similar. For analogous figures see the SI files SI, 
S2, S3, and S4.) Eight of the top ten contributions to the 
divergence between those decades are due to increased 
relative frequencies of use of each of years between 1940 
and 1949, their contribution decreasing chronologically, 
and the other two top ten words are the last two years of 
the previous decade (“1948” and “1949” appear at ranks 
15 and 34, respectively). The last three years in the 1920s 
also appear by way of decreased relative frequency of use 
in the top 60 contributions. Other notable differences 
include: 


“Figure” (40th), “technology” (51st), and “infor¬ 
mation” (56th). 

• Similarly to the divergence between the 1930s and 
1940s, 19 out of the top 30 places are accounted 
for by increased relative frequencies of use in years 
between 1968 and 1980. 

• The words “the” (3rd), “of” (8th), and “which” 
(16th) all decrease noticeably in relative frequency 
and are the highest ranked alphabetical 1-grams. 

• Unlike the divergence between the 1930s and 1940s, 
only masculine pronouns show decreases in the top 
60, while “women” (55th) increases. 

4- First English fiction data set: 1930s versus 1940s 

The first version of English Fiction shows similar 
dynamics to the second version of the unhltered data 
set between the 1930s and the 1940s (see Fig. with 
yearly mentions dominating the ranks. Some exceptions 
include: 

• “Lanny” rising in rank from 49th to 8th. 


• The 11th highest contribution is from “war,” which 
increased in relative frequency. 

• “Hitler” and “Nazi” (increased relative frequencies) 
are ranked 18th and 26th, respectively. 

• Parentheses (13th and 14th) show increased rela¬ 
tive frequencies of use. 

• Personal pronouns show decreased relative frequen¬ 
cies of use. 


• Parentheses falling from 13th and 14th to 36th and 
37th. “ml” (increased relative frequency of use in 
the 1940s) falling from 31st to 55th. 

• “radio” (with increased relative frequency) rising 
from 51st to 30th. 

• “King” is no longer in the top 60 contributions. 

• “patient” enters the top 60 (ranked 51st). 


• The word “King” (41st) also shows a decreased rel¬ 
ative frequency, possibly due to the British line of 
succession. 


3. Second unfiltered English data set: 1950s versus 1980s 

• The top two contributions between the 1950s and 
the 1980s (see Fig. in the English data set 
are both parentheses, which show dramatically 
increased relative frequencies of use. 

• Combined with increased relative frequencies for 
the colon (4th), solidus/virgule (or forward slash) 
(14th), “computer” (32nd), and square brackets 
(58th and 59th), this suggests that the primary 
changes between the 1950s and the 1980s are due 
specifically to computational sources. 


5 . First English fiction data set: 1950s versus 1980s 


This similarity between the original English Fiction 
data set and the unhltered data set also appears in 
the divergence between the 1950s and the 1980s (see 
Fig. 10) with parentheses and years dominating. More¬ 


over, “patients” ranks 13th (with increased relative fre¬ 
quency of use) despite not appearing in the top 60 for the 
unhltered data set. These observations, combined with 
increases in “levels” (47th), “drug” (51st), “response” 
(55th), and “therapy” (56th) demonstrate the original 
hction data set is strongly inhuenced by medical jour¬ 
nals. Therefore, this data set cannot be considered as 
primarily hction despite the label. 


6. Second English fiction data set: 1930s versus 1940s 


• Other technical words showing noticeable increas¬ 
es include “model” (34th), “data” (35th), “per¬ 
cent” and the percentage sign (37th and 39th), 


Fortunately, the same is not true for the second version 
of the English Fiction data set. This is quickly apparent 
upon inspection of the two greatest contributions to the 
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FIG. 13: Upton Sinclair wrote 11 Lanny Budd novels set 
during World War II. The first of these was published in 1940, 
and the last was published in 1953. The net effect of Sinclair’s 
efforts is that his character appears much more frequently in 
the English Fiction (Version 2) data set than Hitler during 
most of the war. This demonstrates the potential impact of 
a single prolific author on the corpus. 


FIG. 14: Time series for “he” and “she” for Version 2. 

The unfiltered normalized frequencies are given by the sol¬ 
id curve. Normalized frequencies in hction are given by the 
dashed curve. These personal pronouns are more common 
in fiction. The pronoun “she” gains popularity through the 
1990s in both data sets, with a more pronounced growth in 
fiction. 


divergence between the 1930s and the 1940s (see Fig. 111. 
The first of these is due to a dramatic increase in the rela¬ 
tive frequencies of use of quotation marks, which implies 
increased dialogue. The second is the name “Lanny” in 
reference to the recurring character Lanny Budd from 
11 Upton Sinclair novels published between 1940 and 
1953. “Budd” ranks 11th in the chart ahead of “Hitler” 
(13th). The normalized frequency series for “Lanny” 
and “Hitler” provided in Fig. demonstrate that Lan¬ 
ny received more mention than Hitler during this time 
period. The chart is littered with the names of fictional 
characters: 


• Studs Lonigan, the 1930s protagonist of a James T. 
Farrel trilogy, secures the 12th spot. (Naturally, he 
is mentioned fewer times during the 1940s.) 

• Dinny Cherrel from the 1930s The Forsyte Saga by 
John Galsworthy secures rank 15. 

• Wang Yuan from the 1930s The House of Earth 
trilogy by Pearl S. Buck ranks 17th and 37th. 

• Detective Bill Weigand, a recurring character creat¬ 
ed by Richard Lockridge in the 1940s, secures rank 
33. 


• The eponymous, original Asimov robot from the 
1940 short story, “Robbie,” ranks 19th. 

• “Mama” (ranked 48th) is none other than the sub¬ 
ject of Mama’s Bank Account, published in 1943 by 
Kathryn Forbes. 


subjects of works translated into English in the 
1940s. 

We note that while Marcel Proust (56th and 33rd), 
who died in 1922 may be present in the 1940s due to 
letters translated by Mina Curtiss in 1949 or other ref¬ 
erences not technically fiction. Similarly, “B.M.” (18th) 
may refer to the author B. M. Bower. Thus, the vast 
majority of prominent words in the word shift may be 
traced not only to authors of fiction, but to the con¬ 
tent of their work. Moreover, the greatest contributions 
to divergence appear to correspond to the most prolific 
authors, particularly Upton Sinclair. 


7. Second English fiction data set: 1930s versus 1940s 


While there are no names of characters in the top diver¬ 
gences between the 1950s and the 1980s, the updated fic¬ 
tion data set (Fig. 12) displays far more variety than the 
original version, including: 


• Decreases in relative frequencies of masculine 
pronouns—e.g., “he” (rank 19) and “himself” 
(rank 48)—and corresponding increases for femi¬ 
nine pronouns—e.g., “her” (3rd), “she” (5th), and 
“She” (6th). We present times series for “he” and 
“she” in Fig. 

• An increase in relative frequencies of contractions 
(see ranks 9, 15, and 21). 


• “Saburov” (ranked 22nd) from Days and Nights 
by Konstantin Simonov and “Diederich” (ranked 
45th) from Der Untertan by Heinrich Mann are 


• A decrease in “shall” (16th) and “must” (49th), 
and a variety of increased profanity (particularly 
ranks 33 and 51). 
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FIG. 15: Time series of technical terms from Version 2: (a) 
English all, (b) English fiction. In the unfiltered data set, 
these technical terms appear frequently and increase in usage 
though the 1980s. In fiction, technical terms show up far 
less frequently and remain relatively stable in usage with the 
notable exception of “computer,” which has been gradually 
gaining popularity since the 1960s. 


• Decreases in “Mr.” (10th) and “Mrs.” (17th). 

• Various shifts in punctuation, particularly fewer 
semicolons (1st) and more periods (2VD). Quo¬ 
tation (11th) and question (18th) marks both see 
increased relative frequencies of use in the 1980s, 
and the four-period ellipsis (20th) loses ground to 
the three-period version (22TD). 


C. The rise of scientific literature in the Google 
Books corpus 


As our JSD analysis has shown above, the unfiltered 
English data sets feature more general scientific terms 
and we compare “percent,” “data,” “Figure,” and “mod¬ 
el” in Fig. [l^ The original fiction data set also features 
these, but also places “patients,” “drug,” “response,” and 
“therapy” among the top 60 contributions. The prima¬ 
ry difference between the unfiltered and original fiction 
data sets in the 1980s (compared to the 1950s) appears to 
consist of the nature of journals sampled. The unfiltered 
components predicted and observed for this particular 
data set seem to be dominated by medical journals. 

As well as having more mentions of time and tech¬ 
nical terms (and parentheses) in the 1980s than in the 
1950s, both unfiltered versions and the first fiction data 
set include both “et” and “al” with greater relative fre¬ 
quency in the 1980s. Perhaps more importantly, years 
do not have a large effect on the dynamics in the second 


English Fiction data set. We see in Fig. 16 that while 


References to Time 



EIG. 16: Normalized frequencies of references to years. The 
top panel resembles a figure from [1] using unfiltered data 
from English Version 2. (The cited paper uses Version 1.) 
Note the characteristic rapid rises and gradual declines, as 
well as the increasing peaks in yearly references. However, 
while the characteristic shape is still present in fiction (Version 
2, bottom)—at much reduced levels—the peaks do not rise. 
The rising effect is likely due to citations from scientific texts. 


peaks for years rise in the unfiltered data, they do not 
in fiction. The absence of rising peaks in fiction strongly 
suggests the rise in peak relative frequencies of years in 
the larger data set is due to a citation bias in the unfil¬ 
tered data set from high sampling of scientific journals. 
This bias casts strong doubt on conclusions that we as 
a culture forget things more quickly than we once did 
based on the observation that half-lives for mentions of 
a given year decline over time [T]. 

The exponential rise in scientific literature is not a new 
phenomenon, and as de Solla Price stated in 1963 [17] (p. 
81) when discussing the half-lives for citations of scientif¬ 
ic literature, “In fields embarrassed by an inundation of 
literature there will be a tendency to bury as much of the 
past as possible and to cite older papers less often than 
is their statistical due.” It would see that an explanation 
for declining half-lives in the mentions of years lies in the 
dynamics of the memory of scientific discoveries rather 
than that of culture. 

For the second fiction data set, we observe in Fig.|l5^, 
that “computer” gains popularity in the fiction data set 
despite other technical words remaining relatively steady 
in usage, as we might expect. This should be encourag¬ 
ing for anyone attempting to analyze colloquial English, 
despite the prolificacy bias apparent from the authors 
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such as Upton Sinclair. 

In the SI files SI, S2, S3, and S4, we include the top 
60 contributions to divergences between each pair of the 
20 decades in each of the four data sets analyzed in this 
paper. In total, 760 figures are included (190 per data 
set) for a grand total of 45,600 contributions. We high¬ 
light some of these here. 


• For divergences to and from the first decade of 
the 1800s, many of the contributions are due to a 
reduction of optical character recognition confusion 
between the letters ‘f’ and ‘s’. For example, in the 
second unfiltered data set between the 1800s and 
1810s, the top two contributions are due to reduc¬ 
tions in “fame” and “os,” respectively. The word 
“same” (ranked 11th) is the first increasing con¬ 
tribution. Decreased relative frequencies of “os,” 
“sirst,” “thofe,” “fo,” “fay,” “cafe,” “fays,” “fome,” 
and “faid” (ranks 3 through 10, respectively) and 
“lise” (12th) all suggest digital misreadings of both 
’f’ and the long ‘s’. (The 13th contribution is 
“Napoleon,” who is mentioned with greater rela¬ 
tive frequently in the 1810s.) 

• Contributions between the 1830s and the 1860s in 
the second unfiltered data set highlight the Amer¬ 
ican Civil War and its aftermath. “State” (11th), 
“General” (19th), “States” (20th), “Union” (37th), 
“Confederate” (48th), “Government” (52nd), “Fed¬ 
eral” (56th), and “Constitution” (59th) all show 
increased relative frequency of use. Religious terms 
tend to decline during this period—e.g., “church” 
(14th), “God” (24th), and “religion” (58th). 

• Between the 1940s and 1960s, the second unfil¬ 
tered dataset shows increases for “nuclear” (43rd), 
“Vietnam” (47th), and “Communist” (50th). The 
relative frequency of “war” (25th) decreases sub¬ 
stantially. Meanwhile in fiction, “Fanny” (5th) 
declines, while “television” (38th) and the Hardy 
Boys (“Hardy” ranks 51st) appear with greater rel¬ 
ative frequencies. 


• Between the 1960s and 1970s, the second fiction 
data set is strongly affected by “Carp” (The World 
According to Garp by John Irving, 1978) at rank 19, 
increased relative frequencies of profanity (ranks 
27, 33, and 38), and increased mentions of “Nixon” 
(41st) and “Spock” (47th, likely due to “Star Trek” 
novels). 


• Between the 1980s and 1990s, the second fiction 
set shows increased relative frequencies of use of 
the words “gay” (15th), “lesbian” (19th), “AIDS” 
(24th), and “gender” (27th). Female pronouns 
(2nd, 8th, and 9th) show increased relative frequen¬ 


cies of use in continuance of Fig. 12 


IV. CONCLUDING REMARKS 


Based on our introductory remarks and ensuing 
detailed analysis, it should now be clear that the con¬ 
tents of the Google Books corpus do not represent an 
unbiased sampling of publications. Beyond being library¬ 
like, the evolution of the corpus throughout the 1900s is 
increasingly dominated by scientihc publications rather 
than popular works. We have shown that even the first 
data set specifically labeled as fiction appears to be sat¬ 
urated with medical literature. 

When examining these data sets in the future, it will 
therefore be necessary to first identify and distinguish 
the popular and scientific components in order to form 
a picture of the corpus that is informative about cul¬ 
tural and linguistic evolution. For instance, one should 
ask how much of any observed gender shift in language 
reflects word choice in popular works and how much is 
due to changes in scientific norms, as well as which might 
precede the other if they are somewhat in balance. 

Even if we are able to restrict our focus to popu¬ 
lar works by appropriately filtering scientific terms, the 
library-like nature of the Google Books corpus will mean 
the resultant normalized frequencies of words cannot be a 
direct measure of the “true” cultural popularity of those 
words as they are read (again, Frodo). Secondarily, not 
only will there be a delay between changes in the public 
popularity of words and their appearance in print, nor¬ 
malized frequencies will also be affected by the prolifica¬ 
cy of the authors. In the case of Upton Sinclair’s Fan¬ 
ny Budd, a fictional character was vaulted to the upper 
echelons of words affecting divergence (even surpassing 
Hitler) by virtue of appearing as the protagonist in II 
novels between 1940 and 1953. Google Books is at best 
a limited proxy for social information after the fact. 

The Google Books corpus’s beguiling power to imme¬ 
diately quantify a vast range of linguistic trends warrants 
a very cautious approach to any effort to extract scien¬ 
tifically meaningful results. Our analysis provides a pos¬ 
sible framework for improvements to previous and future 
works which, if performed on English data, ought to focus 
solely on the second version of the English Eiction data 
set, or otherwise properly account for the biases of the 
unfiltered corpus. 
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